Clinical statistics for non-statisticians: Day one
Start with a bad joke
Two statistics are sitting in a bar. One turns to the other and asks, “So, how do you like married life?”
The other statistic responds …
Put your reaction (“Ha ha”, “Groan”, etc.) in the chat box.
Before I begin anything important, I like to start with a silly joke. Now on Zoom, I often miss student reactions. So when I say something funny, I want you to type “Ha ha” or “Smile” or “LMFAO”. The acronym LMFAO means laughing my something … I forget how the rest of it goes.
Now if the joke is corny, like a really really bad pun, it’s okay to put “Groan”. The only thing bad is if I tell a joke and get no reaction at all.
I’ll be sneaking in some jokes throughout the talk and I really want a reaction from you, good or bad. If I don’t get any reaction to a bad pun, your “pun”ishment will be more bad puns.
So here’s the joke. It has been floating around on the Internet for quite a while, and I can’t find the person who gets credit for this. But here goes.
[Read joke and finish with] “It’s okay but you lose a degree of freedom.”
Okay, I’m waiting for reactions.
Introduction
Tell us one interesting number about yourself
Examples
8: I have traveled to eight countries outside the United States
(Canada, Italy, China, France, Russia, England, Holland, and Iceland)
29: I did not learn how to drive until I was 29 years old
1802: My highest chess rating was 1802, but I am not that good any more.
Speaker notes
I want to learn a bit about all of you, and I’m going to do this in a statistical way. Tell me one interesting number about yourself. It could be something simple, like the number of children you have or something exotic like the height of the highest mountain you have climbed.
Here are three numbers about me.
A bit more about myself
PhD in Statistics in 1982 from the University of Iowa
Currently full professor
Part-time statistical consultant
Funded on 18 research grants
Over 100 peer-reviewed publications
Website with over 2,000 pages
Many invitations to talk at conferences
I have a PhD in Statistics from the University of Iowa. I have always had a strong interest in the computational side of Statistics. My dissertation was 150 pages, and 100 of those pages were computer generated graphs.
I am currently a full professor at the University of Missouri-Kansas City in the Department of Biomedical and Health Informatics. I also do statistical consulting on a part-time basis.
I have been a prolific researcher, receiving support from 18 different grants, and writing over 100 peer-reviewed publications.
I started a website in 1998, writing about data analysis, research ethics, and evidence based medicine. I wrote about two or three pages every week and my site now has over 2,000 pages. It shows the value of persistence.
I love to talk about Statistics and have given many presentations at regional, national, and international conferences. This ranges from short 15 minute talks to day long short courses.
Outline of the three day course
Day one: Numerical summaries and data visualization
Day two: Hypothesis testing and sampling
Day three: Statistical tests to compare treatment to a control and regression models
My goal: help you to become a better consumer of statistics
Day one topics
Numerical summaries
When should you present the mean versus the median
When should you present the range versus standard deviation
How should you display percentages
Why should you round liberally
Speaker notes
Today, you will learn about numerical summaries.
Day one topics (continued)
Data visualization
How should you display continuous data
Why is the normal bell-shaped curve important
How should you display categorical data
How do you illustrate trends and patterns
What are some common mistakes in the choice of colors
Counting and proportions
Counts are the most common statistic
Counts are error prone
Counts require a solid operational definition
Speaker notes
Let’s start with the simplest statistic of all a simple count. This is probably the most common statistic produced.
But counts can be tricky. The counting process is error prone and requires a solid operational definition.
Student exercise
Count the number of occurrences of the letter “e”.
A quality control program is easiest
to implement from the top down.
Make sure that you understand the
the commitment of time and money
that is involved. Every workplace is
different, but think about allocating
10% of your time and 10% of the
time of all your employees to
quality control.
Speaker notes
Here’s an exercise I want you to do. Just count the number of occurrences of the letter “e”. Once you have your answer, type it in the chat box.
PAUSE HERE.
The numbers are different because of two things. First, it is easy to make mistakes. Did anyone notice the repetition of the word “the” at the end of the third line and the beginning of the fourth. It would be easy to miss that and count one less “e”.
What did you do with the first e in “Every”?
Did you count the e’s in the quotes itself or also on the slide instructions and the slide header?
Figure 1: Image of a haemocytometer
Speaker notes
This image is take from the WHO laboratory manual for the examination and processing of human semen, published in 2021. It shows a haemocytometer, an instrument used for counting the number of cells. To get a proper count, you need to include any cells inside the four by four grid of large squares in the middle of this micrograph. But what does “inside” mean? Should you count only those cells entirely inside the four by four grid. Or should you include cells that are partially inside the grid?
One rule is to count cells if the head of the sperm cell touches the top or right side of a square, but not if it touches the bottom or left side of the square. And don’t count a sperm cell if only the tail is inside the square.
That’s not the only way you can do this, but just make sure that whatever convention you use for deciding “inside” versus “outside” is consistent across your laboratory.
Figure 2: Titanic data: counts of survival by gender
Speaker notes
Here is some count data from an interesting data set. It shows who survived and who did not on the passenger ship, Titanic.
The Titanic was an enormous ship. It was bigger than any passenger ship ever built at the time. It was so large that they thought it was unsinkable. But in its first voyage across the Atlantic Ocean, it struck an iceberg and sunk.
They kept records on everyone on the ship: sex, age, and passenger class. There were 462 women on the ship. 308 of them survived, including Kate Winslet. The men did not fare as well. This was in a time when they really believed in the saying “Women and children first”. If this happened today, I’d push past all the ladies and the little kids and jump in that life boat first.
Among the 851 men, 709 died, including, sadly, Leonardo Di Caprio.
I’m making a reference to a popular movie, “Titanic” that was released in 1997. Has anyone seen that movie?
Anyway, you might want to examine mortality trends more closely by computing percentages. But there are three different ways you could compute these percentages.
Figure 3: Titanic data with column percentages
Speaker notes
Here are the percentages computed by dividing by the column totals. Divide the 308 surviving females by the total number of survivors, 450, to get 68%. Divide the 142 surviving males by 450 to get 32%. So those lifeboats were mostly, but not entirely, filled with women.
These are called column percents. They add up to 100% within each column: 18% + 82% = 100% and 68% + 32% = 100%.
Figure 4: Titanic data with row percentages
Speaker notes
You could also divide by the row totals. Divide the 308 surviving women by the total number of women, 462, to get a survival rate of 67%. Divide the 142 surviving men by the total number of men, 851, to get 17%.
!7%! This shows how poorly the men fared on the Titanic. If you were female, you might have died, but more likely than not you did survive. For the men, not such good news. Most of them died. Only a small fraction survived.
This is called the row percentages. These percentages add up to 100 within each row: 33% + 67% = 100% and 83% + 17% = 100%.
Percentages divided by grand total
Figure 5: Titanic data with cell percentages
Speaker notes
You could also divide all the numbers by the grand total of 1,313. The 308 female survivors represented a bit less than 24% of all the passengers that set sail from England.
The 142 male survivors represented a bit less than 11% of all the survivors.
These are called the cell percentages. They add up to 100% across the entire table: 12% + 54% + 24% + 11% = 101%. Close enough!
Which makes the most sense? It depends on your perspective. If you want to test the hypothesis that male passengers on the Titanic had a much smaller risk of dying, then the row percentages make the most sense.
But from the perspective of the Carpathia, the ship that rescued the survivors, the column percents make the most sense. They had to make room on their ship for 450 passengers, 68% who were female and 32% who were male. I bet that the lines for the women’s bathrooms on the Carpathia were really long.
My recommendations
Treatment or exposure as rows
Outcome as columns
Usually report row percentages
Female survival rate: 67%
Male survival rate: 17%
But sometimes column percentages
Survivors: 68% female, 32% male
Speaker notes
I have some general guidelines that I use. They don’t always work, but they work most of the time.
If you have a variable that represents a treatment or exposure, try using that as the rows of the table. If you have a variable that represents an outcome, try using that as the columns of the table. Sometimes, there are no clearly identified treatment variables and no clearly identified outcome variables. But try to categorize them this way, if you can.
With a table lined up with the treatments as the rows and the outcomes are the variables, calculate the row percentages.
In the Titanic data, survival is clearly an outcome. So arrange the table like I did with sex as the rows and survival as the columns and compare the two survival rates: a healthy 67% for females and a feeble 17% for males.
But sometimes you will find that the column percents make more sense. It does depend on what question you are trying to answer with the data.
Some rationale for these choices
My way
Survived
No Yes
Sex Female 33% (154) 67% (308)
Male 83% (863) 17% (142)
Not my way
Sex
Female Male
Survived No 33% (154) 83% (863)
Yes 67% (308) 17% (142)
Speaker notes
Now, I believe it is important to think carefully about which is your rows and which is your columns. Here’s the layout that I recommend on the left and the layout that I don’t recommend on the right. The key comparison is among survival rates, 67% for females and only 17% for males. When you orient my way with the treatment/exposure (Sex) as rows and the outcome (Survived) as the columns, the numbers 67% and 17% are very close to one another. In the alternate layout the numbers you are most interested in comparing are not as close together.
Now this is not an absolute rule. Sometimes I’ll switch things up. But about 90% of the time, I find that the layout with the treatment or exposure as the rows and the outcome as the columns, the table just looks better.
Break
What have you just learned?
What is coming next
Practice exercise
Calculation of the mean and median
On your own
Calculate row and column percentages for the following tables. Interpret your results.
Speaker notes
Now try to report both column and row percents for one of these two tables. Breakout room #1 work on the passenger class table and breakout room #2 work on the child data.
Put your percentages in a table using a word processing program or text editor so you can share your results with the group.
Be sure to interpret these numbers. Come back together again in about 10 minutes.
Figure 8: Cartoon image of Professor Mean
Speaker notes
Here’s a cartoon image of Professor Mean. I know this looks like it was drawn by a professional artist, but it was actually drawn by me. Really!
Professor Mean is my alter ego on the Internet. For those who don’t get the inside joke, I point out that Professor Mean is not just your average professor.
I will use the terms mean and average interchangeably througout this talk.
Figure 9: Road with a median strip
Speaker notes
This is an image of a traffic median. This is a strip of land, typically raised from the road surface, that splits the road in half.
In Statistics, the median is the data value that splits the data in half. Half of the data is smaller than the median and half of the data is larger than the median.
Calculation of the mean and median
Mean
Add up all the values, divide by the sample size
Median
Sort the data
Select the middle value if n is odd
go halfway between the two middle values if n is even
Speaker notes
You already know how to compute the average. Add up all the values and divide by the sample size.
The median is also simple. Sort the data and choose the “middle” value. If n is odd, there is one value that is right in the middle. With five data values, the median is the third value of the sorted list. The first and second values are smaller and the fourth and fifth values are larger.
With an even number, there are two middle values. Go halfway between them. If you have eight data values, the midpoint between the fourth and fifth values splits the data in half. The first through fourth values in the sorted list are smaller and the fifth through eighth values are larger.
Formal mathematical definitions
Mean
\(\bar{X}=\frac{1}{n}\Sigma X_i\)
Median
Sorted values \(X_{[1]},X_{[2]},...,X_{[n]}\)
\(X_{[(n+1)/2]}\) if n is odd,
\((X_{[n/2]}+X_{[n/2+1]})/2\) if n is even
Speaker notes
Here are the mathematical formulas for the mean and median. I know some people hate formulas, but I love them. With a few symbols and Greek letters, you can express really deep and beautiful ideas. Well these formulas aren’t all that deep.
Bacteria before and after A/C upgrade
Room Before After Change
121 11.8 10.1 -1.7
125 7.1 3.8 -3.3
163 8.2 7.2 -1.0
218 10.1 10.5 0.4
233 10.8 8.3 -2.5
264 14 12 -2.0
324 14.6 12.1 -2.5
325 14 13.7 -0.3
Before remediation mean
11.8 + 7.1 + 8.2 + 10.1 + 10.8 + 14 + 14.6 + 14 = 90.6
90.6 / 8 = 11.325
Round to 11.3
Speaker notes
Here’s the data for bacterial counts before remediation. If you add the eight values up, you get 90.6. Divide this by eight to get 11.325. Always round liberally when you are talking about the mean.
After remediation mean
10.1 + 3.8 + 7.2 + 10.5 + 8.3 + 12 + 12.1 + 13.7 = 77.7
77.7 / 8 = 9.7125
Round to 9.7
Before remediation median (1/4)
121 11.8
125 7.1
163 8.2
218 10.1
233 10.8
264 14.0
324 14.6
325 14.0
Speaker notes
Here is the data for bacteria counts before remediation. Notice that the data is arranged by room number.
Before remediation median (2/4)
125 7.1
163 8.2
218 10.1
233 10.8
121 11.8
264 14.0
325 14.0
324 14.6
Speaker notes
The first thing you do is sort the data from the lowest bacteria count to the highest bacteria count.
The data was arranged by toom number, but now it is arranged by bacterial count.
Before remediation median (3/4)
125 7.1
163 8.2
218 10.1
233 10.8 10.8
121 11.8 11.8
264 14.0
325 14.0
324 14.6
Speaker notes
Then pick out the middle value. If you have an even number of data points, there will be two middle values.
In this data set, the two middle values are the fourth and fifth largest values out of eight.
Before remediation median (4/4)
125 7.1
163 8.2
218 10.1
233 10.8 10.8
(10.8 + 11.8) / 2 = 11.3
121 11.8 11.8
264 14.0
325 14.0
324 14.6
Speaker notes
If there are two middle values, just average them.
After remediation median (1/4)
121 10.1
125 3.8
163 7.2
218 10.5
233 8.3
264 12.0
324 12.1
325 13.7
After remediation median (2/4)
125 3.8
163 7.2
233 8.3
121 10.1
218 10.5
264 12.0
324 12.1
325 13.7
Speaker notes
Just like before, you sort the data.
After remediation median (3/4)
125 3.8
163 7.2
233 8.3
121 10.1 10.1
218 10.5 10.5
264 12.0
324 12.1
325 13.7
Speaker notes
Then pick out the middle value. Here again, there are two middle values.
After remediation median (4/4)
125 3.8
163 7.2
233 8.3
121 10.1 10.1
(10.1 + 10.5) / 2 = 10.3
218 10.5 10.5
264 12.0
324 12.1
325 13.7
Choosing between the mean and median
When do you use the mean?
When totals are important
When do you use the median
When outliers/skewness might distort your conclusions
Often, either is fine
Break
What have you just learned?
Calculation of the mean and median
What is coming next
Criticisms of the mean and median
Criticisms of the mean and median
Are you combining apples and onions?
Are you ignoring minorities?
There’s a wonderful cartoon by Dana Fradon that appeared in The New Yorker in 1976. She shows a road going into town and the sign by the side of the road reads “Hillsdale, Founded 1802, Altitude 600, Population 3,700. Total 6,122.” You can’t add these things together.
It’s similar for means. There was a dataset showing housing prices for homes in Boston and none of the analyses seemed to make sense. The problem in Boston is that a small number of the houses had prices that were out of sync with their other homes. These were historical houses, such as Paul Revere’s house.
When you are averaging numbers, maybe it’s okay to have a few oranges in with the apples. A mix of apples and oranges is just fruit salad. You shouldn’t have a problem with that.
When it becomes a problem is when the data are so diverse that it becomes a mix of apples and onions. There are lots of great recipes that mix apples and oranges, but none that mix apples and onions.
The other problem is that an average may be a reasonable number to represent the majority of patients in your sample, but it may masks some important trends that appear in a minority.
This is a big problem in a larger context than just the mean or median. There are some very fancy high tech predition models that work very well for most people and the statistics like the mean and median back this up quite nicely. But the prediction models perform terribly for minority groups.
Use of the mean for ordinal data
Figure 10: Excerpt from Gould 1985 publication
Speaker notes
Stephen Jay Gould was a famous Evolutionary Biologist. He was a prolific writer with 20 books and 300 essays. Much of his writing was for academic researchers, but just as much was for the general public.
One of his most famous essays was “The Median Isn’t the Message”. The title is a take-off of a quote by Marshall McLuhan, “The medium is the message” which itself has an interesting history that you should investigate on your own.
The Gould essay was written in 1985 for Discover Magazine. It has been reprinted many times, and you can easily find the full text with a simple Google search.
The image shown here is taken from phoenix5.org, an informational site for patients with prostate cancer.
Bridge 2001, PMID: 11405531
Figure 11: Bridge and McKenzie 2001
Bridge 2001, PMID: 11405531 (continued)
The measurement of airway resistance by the interrupter technique (Rint) needs standardization. Should measurements be made be during the expiratory or inspiratory phase of tidal breathing? In reported studies, the measurement of Rint has been calculated as the median or mean of a small number of values, is there an important difference?
Bridge 2001, PMID: 11405531 (continued)
In the present data the mean of a set of values contributing to a measurement was not significantly different from the median. However, the use of the median has been recommended since it is less affected by possible outlying values such as might be included by fully automated equipment.
Tosato 2021, PMID: 34352201
Figure 12: Tosato et al 2021
Tosato 2021, PMID: 34352201 (continued)
Symptom persistence weeks after laboratory-confirmed severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) clearance is a relatively common long-term complication of Coronavirus disease 2019 (COVID-19). Little is known about this phenomenon in older adults. The present study aimed at determining the prevalence of persistent symptoms among older COVID-19 survivors and identifying symptom patterns.
Tosato 2021, PMID: 34352201 (continued)
The mean age was 73.1 ± 6.2 years (median 72, interquartile range 27), and 63 (38.4%) were women. The average time elapsed from hospital discharge was 76.8 ± 20.3 days (range 25-109 days).
Ielapi 2021, PMID: 34968328
Figure 13: Tosato et al 2021
Ielapi 2021, PMID: 34968328 (continued)
Background. Insomnia is one of the major health problems related with a decrease in quality of life (QOL) and also in poor functioning in night-shift nurses, that also may negatively affect patients’ care. The aim of this study is to evaluate the prevalence of insomnia in night shift nurses.
Ielapi 2021, PMID: 34968328 (continued)
Excerpt from Table 1.
Data reported as mean ± standard deviation or median [Q1-Q3]
Overall (n = 2′355)
Age, years 40.4 ± 10.3
Months of work 168 [72–300]
Night shifts per month, number 6.3 ± 1.4
Time to reach workplace, minutes 45 [45–65]
Rest time, minutes 180 [4–240]
Rest in the afternoon, minutes 30 [0–120]
Number of coffees, mean 2.5 ± 1.5
Number of coffees during night shift, mean 1.4 ± 1.1
Chen 2019, PMID: 31806195
Figure 14: Chen et al 2019
Chen 2019, PMID: 31806195 (continued)
Background: The prices of newly approved cancer drugs have risen over the past decades. A key policy question is whether the clinical gains offered by these drugs in treating specific cancer indications justify the price increases.
Chen 2019, PMID: 31806195 (continued)
Results: We found that between 1995 and 2012, price increases outstripped median survival gains, a finding consistent with previous literature. Nevertheless, price per mean life-year gained increased at a considerably slower rate, suggesting that new drugs have been more effective in achieving longer-term survival. Between 2013 and 2017, price increases reflected equally large gains in median and mean survival, resulting in a flat profile for benefit-adjusted launch prices in recent years.
Percentiles
Figure 15: Illustration of the 75th percentile
Speaker notes
I want to mention percentiles briefly. A percentile is a value that splits the data so that a certain percentage is smaller and a certain percentage is larger.
The 75th percentile, for example will be above 75% of the data and below 25% of the data. This graph illustrates the 75th percentile for some arbitrary data. THe gray bars represent about 75% of the data and the white bars represent about 25% of the data.
I use a few weasel words like “roughly” and “about” because you can’t always get a perfect split. But you can usually come close.
Computing percentiles
Many formulas
Differences are not worth fighting over
My preference (pth quantile)
Sort the data
Calculate p*(n+1)
Is it a whole number?
Yes: Select that value, otherwise
No: Go halfway between
Special cases: p(n+1) < 1 or > n
Speaker notes
There are close to a dozen different ways to compute a percentile, but the differences between the values selected are small and not worth fussing about.
Here is my preference for choosing the pth quantile (remember that for quantiles, you range between 0 and 1, not between 0 and 100).
Calculate the quantity p*(n+1). If that value is a whole number, great! You just select that value. If it is a fractional value, round up and down and go halfway between.
Once in a while, you’ll get an extreme case, where p(n+1) is less than 1 or greater than n. Just use a bit of common sense.
If you have nine values and p(n+1) is 9.2, you can’t go halfway between the 9th and 10th observations. There is no 10th observation. So just choose the 9th or largest value.
Likewise if p(n+1) is 0.8, you can’t go halfway between the zeroth and first observation. There is no zeroth observation. Just choose the first or smallest value.
Some examples of percentile calculations
Example for n=39
For 5th percentile, p(n+1)=2 -> 2nd smallest value
For 4th percentile, p(n+1)=1.6 -> halfway between two smallest values
For 2nd percentile, p(n+1)=0.8 -> smallest value
Speaker notes
Suppose you have 39 observations. For the 5th percentile or the 0.05 quantile, p(n+1) equals 2. Lucky you. The second smallest observation is the 5th percentile. For the 4th percentile or the 0.04 quantile, you get p(n+1) equal to 1.6. Go halfway between 1, the smallest value, and 2, the second smallest value.
The 2nd percentile represents one of the special cases. You calculate p(n+1) and get 0.8. You can’t go halfway between 0 and 1, so just choose the smallest value.
Some terminology
Percentile: goes from 0% to 100%
Quantile: goes from 0.0 to 1.0
90th percentile = 0.9 quantile
Quartiles: 25th, 50th, and 75th percentiles
Lower quartile: 25th percentile
Upper quartile: 75th percentile
Speaker notes
A percentile always refers to a percentage. So it has to be between 0% and 100%. Sometimes, you may see references to a quantile. A quantile is a percentile, but is expressed as a proportion rather than a percent. A quantile goes from 0.0 to 1.0. The 25th percentile and the 0.25 quantile are the same thing.
You might see the term “quartiles”. These are the 25th, 50th, and 75th percentiles. These three values split the data into quarters.
If you see “lower quartile”, it means the 25th percentile. Likewise, “upper quartile” means the 75th percentile.
Let me be try to be careful about terminology here. But, sometimes I will mess up and use “percentile” when I mean “quantile”.
Before remediation upper quartile (1/4)
121 11.8
125 7.1
163 8.2
218 10.1
233 10.8
264 14.0
324 14.6
325 14.0
Speaker notes
Here is the data for bacteria counts before remediation. Let’s calculate the upper quartile, also known as the 0.75 quantile or the 75th percentile.
Before remediation upper quartile (2/4)
125 7.1
163 8.2
218 10.1
233 10.8
121 11.8
264 14.0
325 14.0
324 14.6
Speaker notes
Just like before, you sort the data.
Before remediation upper quartile (3/4)
125 7.1
163 8.2
218 10.1
233 10.8
121 11.8
264 14.0 14
325 14.0 14
324 14.6
Speaker notes
With n=8, you get p(n+1) = 6.75. So pick out the sixth and seventh values.
Before remediation upper quartile (4/4)
125 7.1
163 8.2
218 10.1
233 10.8
121 11.8
264 14.0 14
(14 + 14) / 2 = 14
325 14.0 14
324 14.6
After remediation upper quartile (1/4)
121 10.1
125 3.8
163 7.2
218 10.5
233 8.3
264 12.0
324 12.1
325 13.7
After remediation upper quartile (2/4)
125 3.8
163 7.2
233 8.3
121 10.1
218 10.5
264 12.0
324 12.1
325 13.7
Speaker notes
Just like before, you sort the data.
After remediation upper quartile (3/4)
125 3.8
163 7.2
233 8.3
121 10.1
218 10.5
264 12.0 12
324 12.1 12.1
325 13.7
After remediation upper quartile (4/4)
125 3.8
163 7.2
233 8.3
121 10.1
218 10.5
264 12.0 12
(12 + 12.1) / 2 = 12.05
324 12.1 12.1
325 13.7
When you should use percentiles
Characterize variation
Exposure issues
Not enough to control median exposure level
Quantify extremes
What does “upper class” mean?
Quality control
Almost all products must meet a minimum standard
Speaker notes
There are many reasons why you might be interested in percentiles rather than the mean or median. Actually, the median is a percentile, the 50th percentile, but what I mean is percentiles other than 50%.
One important use of percentiles is looking at the middle 50% of the data. This is the data between the lower quartile (25th percentile) and the upper quartile (75th percentile). Is the middle 50% of the data bunched tightly together or spread widely apart?
Percentiles are also important in the study of exposures. If you work in an environment where the median worker has a safe level of exposure, you could easily end up with 20%, 30% or more of the workers dying from unsafe exposures. It is important to insure that not just the median, but a very high percentile like the 99th percentile of exposure levels is at a safe level.
Percentiles also help to define extreme groups. You can, for example, define the term upper class as anyone earning more than the 90th percentile of income.
Percentiles also can help with quality control. If you make a claim about a product, you want to make sure that that claim is not valid at a median level but at a much higher level. You don’t sell 500 mg bottles of liquid Tylenol is your factory is churning out a median fill level of 500 mg. Half of your customers would be cheated. Instead you insure that the 98th percentile coming out of the factory floor is at least 500 mg. You lose a bit of money because most bottles contain more than 500 mg, but the cost of an irate customer is worth more than the cost of 50 overfilled bottles.
Standard deviation
\[S = \sqrt{\frac{1}{n-1}\Sigma(X_i-\bar{X})^2}\]
At least one alternative formulas.
Speaker notes
The standard deviation is a commonly used measure of how spread out the data is. The formula is a bit messy, but if you look carefully at it, you will see that it is a measure of how far each individual value is from the overall mean.
Now, maybe you’ve seen or used a different formula. Don’t worry about it. In a short course like this, I won’t ask you to calculate anything as tedious as a standard deviation. Let the computer do all of the work.
Why is variation important
Variation = Noise
Too much noise can hide signals
Variation = Heterogeneity
Too little heterogeneity, hard to generalize
Too much heterogeneity, mixing apples and oranges
Variation = Unpredictability
Too much unpredictability, hard to prepare for the future
Variation = Risk
Too much risk can create a financial burden
Speaker notes
I want to discuss measures of variation now. Variation gets at the heart and soul of clinical statistics. A large portion of statistical analysis involves characterizing variation.
Variation can be thought of as a measure of noise. In general, but not always, noise is bad. Consider measuring a patient’s glucose level, to see if you have early evidence of diabetes. Your glucose level varies a lot during the day based on whether you skipped breakfast or decided to get a mid-afternoon Snickers bar. Your glucose level is noisy. A high level might or might not mean trouble. A low value might or might not mean you are safe. The large standard deviation of your measures of blood glucose indicates noise.
That’s why you are asked to take an overnight fast before testing your blood glucose level. Controlling your diet by not eating anything after midnight provides a more consistent measure of blood glucose. It has a smaller standard deviation and a high or low value is more helpful in diagnosis.
Variation can also be thought of as a measure of heterogeneity. Heterogeneity is also bad sometimes, but there are times when you want a fair amount of heterogeneity. A research study that has a lot of variation is better at providing a complete picture of what a typical patient is. Outcomes that are consistent in the presence of demographic heterogeneity give you more confidence in generalizing the results of a research study. You have some assurance that the therapy is not restricted to helping a small segment of patients.
Too much heterogeneity, though, can mean that any summary measure is a mixture of apples and oranges. You have to find the right balance.
Variation can be equated to unpredictability. The number of beds needed in a hospital does vary, and this makes it difficult to staff properly. The more variation in beds needed, the more headaches you have.
Variation can also be equated to risk. If you invest in a new drug, paying millions or even billions of dollars in testing, you are doing so with the hope that your investment will pay off. Unfortunately, the market for your drug is uncertain, and you might end up with no market at all if your clinical trials fail to convince FDA. There is variation in the return on your investment, and the more variation there is, the more risky your development plans are.
Should you try to minimize variation?
Yes, for early studies
Easier to detect signals
Proof of concept trials
No, for later studies
Easier to generalize results
Pragmatic trials
Speaker notes
It is a bit of a generalization, but most researchers try to avoid variation in early studies. By early studies, I mean studies of therapies that have not yet been extensively tested in a broad range of settings. Less variation means that there is a greater chance to detect signals. You remove variation by using very strict entry criteria on who can get into the study. You remove variation by tightly controlling what the patient is allowed to do (e.g., no concommitant medications). You remove variation by tightly standardizing the delivery of the intervention and the assessment of the outcome. You reduce variation by removing patients who deviate from the research protocol requirements.
These are known as proof of concept trials. If a new therapy cannot succeed even under the tight controls, there is no point in studying it futher. But success in a tightly controlled environment does not guarantee success in the real world.
If you are planning a trial that comes after many similar trials, you actually may want to encourage variation. Broaden the inclusion criteria so that the patients in the trial look no different than the patients you see every day in your clinic.
Standard deviation
\[S = \sqrt{\frac{1}{n-1}\Sigma(X_i-\bar{X})^2}\]
At least one alternative formulas.
Speaker notes
The standard deviation is a commonly used measure of how spread out the data is. The formula is a bit messy, but if you look carefully at it, you will see that it is a measure of how far each individual value is from the overall mean.
Now, maybe you’ve seen or used a different formula. Don’t worry about it. In a short course like this, I won’t ask you to calculate anything as tedious as a standard deviation. Let the computer do all of the work.
The bell shaped curve
Does your variation follow a bell shaped curve?
Values in the middle are most common
Frequencies taper off away from the center
Symmetry on either side
A bell shaped curve = better characterization of variation
Speaker notes
Much variation in the real world follows a bell shaped curve, alternately called a normal distribution. You can assess whether you have a bell shaped curve using a histogram. Look for values in the middle being most common. The frequencies should taper off slowly as you moved away from the middle. The histogram should have symmetry. The left side of the histogram should be roughly equivalent to the right side of the histogram.
Not a bell shaped curve (1/4)
Figure 16: Bimodal histogram
Speaker notes
Here’s a histogram that shows a bimodal distribution. The frequencies are not highest in the center of the data. This is not a bell shaped curve.
Not a bell shaped curve (2/4)
Figure 17: Skewed histogram
Not a bell shaped curve (3/4)
Figure 18: Uniform histogram
Speaker notes
Here’s a histogram that shows a symmetric distibution, but the frequencies do not taper off as you move away from the center. This is not a bell shaped curve.
Not a bell shaped curve (4/4)
Figure 19: Heavy-tailed histogram
Speaker notes
Here’s a histogram that shows a symmetric distibution, but the frequencies taper off at first, but then flatten out. This is called a heavy tailed distribution and it tends to produce outliers, extreme values, on both sides. This is not a bell shaped curve.
A bell shaped curve (finally!)
Figure 20: Bell-shaped histogram
Speaker notes
Here’s a histogram that shows a symmetric distribution, with the most frequent values in the center and frequencies that taper off on either side. This is a bell shaped curve.
Plus or minus one standard deviation
Figure 21: Percentage within one s
Speaker notes
This shows the bell shaped curve with the data within one standard deviation of the mean highlighted in gray. Roughly 68% of the data lies within one standard deviation of the mean. This is only true if the variation follows a bell shaped curve.
Plus or minus two standard deviations
Figure 22: Percentage within two s
Speaker notes
This shows the bell shaped curve with the data within two standard deviations of the mean highlighted in gray. Roughly 95% of the data lies within one standard deviation of the mean. This is only true if the variation follows a bell shaped curve.
Plus or minus three standard deviations
Figure 23: Percentage within three s
Speaker notes
This shows the bell shaped curve with the data within two standard deviations of the mean highlighted in gray. Roughly 95% of the data lies within one standard deviation of the mean. This is only true if the variation follows a bell shaped curve.
Lin 2022, PMID: 36126916
Figure 24: Lin et al 2022
Lin et al 2022 patient ages
Figure 25: Excerpt from Table 1 of Lin et al 2022: ages
Lin et al 2022 Charlson Comorbidity Index
Figure 26: Excerpt from Table 1 of Lin et al 2022: CCI
Lin et al 2022 PHQ-2 scores
Figure 27: Excerpt from Table 1 of Lin et al 2022: PHQ-2
Which visualization to choose?
How should you display continuous data
How should you display categorical data
How do you illustrate trends and patterns
What are some common mistakes in the choice of colors
http://www.pmean.com/posts/misuse-of-gradient/
http://blog.pmean.com/rainbows/
Primary colors
Figure 28: Color combinations
Color combinations: yellow
Color combinations: magenta
Figure 29: Green plus blue
The color cube
Figure 30: Illustration of the color cube
The color cylinder
Figure 31: Color cylinder
Rainbow
Harsh contrasts
Lighter rainbow
## Darker rainbow
Gentler contrasts
Equally spaced hues
Figure 32: Color choices for nominal data
Figure 33: Illustration of the rainbow gradient
Figure 34: Clothing mistake: using too many colors
ADvertisement with a single red umbrella
Speaker notes
Graphic designers have known for quite a while that a restrained use of colors can be very effective. Here is an image from a YouTube video clip,
The Travelers - Look under the Umberella commercial (1986). Retrieved 2019-09-07 from https://www.youtube.com/watch?v=3zQX66jd_c0
The single red umbrella in a sea of black umbrellas stands out. Your eye can’t help but follow this umbrella as it travels across the screen from left to right. It’s a very powerful image.
A small dollop of color in your visualizations can be far more effective than using a whole bunch of different colors.
Figure 35: Use of color to highlight a single individual
Speaker notes
Here is a second example, from the movie, Legally Blonde. In this scene, the main character, Elle Woods, played by Reese Witherspoon, shows her individuality by opening up a bright orange and white Macintosh computer. All the other students are using generic black laptops.
This has practical implications for data visualization.
Figure 36: How many “5’s” are in this figure?
Speaker notes
Here’s a simple exercise, count the number of “5’s” on this graph. Don’t include the “5” that appears in the caption.
When you have an answer, type it in the chat box.
[Pause here]
Now I did try to help by using a different color for each number.
Figure 37: Repeat question. How many “5’s” are in this figure?
Speaker notes
Okay, now repeat this exercise. How many “5’s” do you count? Notice how much faster it is when there is are two colors insead of nine.